554 research outputs found
Cross-Lingual Adaptation using Structural Correspondence Learning
Cross-lingual adaptation, a special case of domain adaptation, refers to the
transfer of classification knowledge between two languages. In this article we
describe an extension of Structural Correspondence Learning (SCL), a recently
proposed algorithm for domain adaptation, for cross-lingual adaptation. The
proposed method uses unlabeled documents from both languages, along with a word
translation oracle, to induce cross-lingual feature correspondences. From these
correspondences a cross-lingual representation is created that enables the
transfer of classification knowledge from the source to the target language.
The main advantages of this approach over other approaches are its resource
efficiency and task specificity.
We conduct experiments in the area of cross-language topic and sentiment
classification involving English as source language and German, French, and
Japanese as target languages. The results show a significant improvement of the
proposed method over a machine translation baseline, reducing the relative
error due to cross-lingual adaptation by an average of 30% (topic
classification) and 59% (sentiment classification). We further report on
empirical analyses that reveal insights into the use of unlabeled data, the
sensitivity with respect to important hyperparameters, and the nature of the
induced cross-lingual correspondences
The Argument Reasoning Comprehension Task: Identification and Reconstruction of Implicit Warrants
Reasoning is a crucial part of natural language argumentation. To comprehend
an argument, one must analyze its warrant, which explains why its claim follows
from its premises. As arguments are highly contextualized, warrants are usually
presupposed and left implicit. Thus, the comprehension does not only require
language understanding and logic skills, but also depends on common sense. In
this paper we develop a methodology for reconstructing warrants systematically.
We operationalize it in a scalable crowdsourcing process, resulting in a freely
licensed dataset with warrants for 2k authentic arguments from news comments.
On this basis, we present a new challenging task, the argument reasoning
comprehension task. Given an argument with a claim and a premise, the goal is
to choose the correct implicit warrant from two options. Both warrants are
plausible and lexically close, but lead to contradicting claims. A solution to
this task will define a substantial step towards automatic warrant
reconstruction. However, experiments with several neural attention and language
models reveal that current approaches do not suffice.Comment: Accepted as NAACL 2018 Long Paper; see details on the front pag
TIR 2015 Workshop Preface
Presents the introductory welcome message from the conference proceedings. May include the conference officers' congratulations to all involved with the conference event and publication of the proceedings record
Retrieval Models for Genre Classification
Genre provides a characterization of a document with respect to its form or functional trait. Genre is orthogonal to topic, rendering genre information a powerful filter technology for information seekers in digital libraries. However, an efficient means for genre classification is an open and controversially discussed issue. This paper gives an overview and presents new results related to automatic genre classification of text documents. We present a comprehensive survey which contrasts the genre retrieval models that have been developed for Web and non-Web corpora. With the concept of genre-specific core vocabularies the paper provides an original contribution related to computational aspects and classification performance of genre retrieval models: we show how such vocabularies are acquired automatically and introduce new concentration measures that quantify the vocabulary distribution in a sensible way. Based on these findings we construct lightweight genre retrieval models and evaluate their discriminative power and computational efficiency. The presented concepts go beyond the existing utilization of vocabulary-centered, genre-revealing features and open new possibilities for the construction of genre classifiers that operate in real-time
A keyquery-based classification system for CORE
We apply keyquery-based taxonomy composition to compute a classification system for the CORE dataset, a shared crawl of about 850,000 scientific papers. Keyquery-based taxonomy composition can be understood as a two-phase hierarchical document clustering technique that utilizes search queries as cluster labels: In a first phase, the document collection is indexed by a reference search engine, and the documents are tagged with the search queries they are relevant—for their so-called keyqueries. In a second phase, a hierarchical clustering is formed from the keyqueries within an iterative process. We use the explicit topic model ESA as document retrieval model in order to index the CORE dataset in the reference search engine. Under the ESA retrieval model, documents are represented as vectors of similarities to Wikipedia articles; a methodology proven to be advantageous for text categorization tasks. Our paper presents the generated taxonomy and reports on quantitative properties such as document coverage and processing requirements
Paraphrase Acquisition from Image Captions
We propose to use image captions from the Web as a previously underutilized
resource for paraphrases (i.e., texts with the same "message") and to create
and analyze a corresponding dataset. When an image is reused on the Web, an
original caption is often assigned. We hypothesize that different captions for
the same image naturally form a set of mutual paraphrases. To demonstrate the
suitability of this idea, we analyze captions in the English Wikipedia, where
editors frequently relabel the same image for different articles. The paper
introduces the underlying mining technology, the resulting Wikipedia-IPC
dataset, and compares known paraphrase corpora with respect to their syntactic
and semantic paraphrase similarity to our new resource. In this context, we
introduce characteristic maps along the two similarity dimensions to identify
the style of paraphrases coming from different sources. An annotation study
demonstrates the high reliability of the algorithmically determined
characteristic maps
- …